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Abstract 

We  consider  the  problem  of  decentralized  detection  under  constraints  on  the  number  of  bits  that 
can  be  transmitted  by  each  sensor.  In  contrast  to  most  previous  work,  in  which  the  joint  distribution 
of  sensor  observations  is  assumed  to  be  known,  we  address  the  problem  when  only  a  set  of  empirical 
samples  is  available.  We  propose  a  novel  algorithm  using  the  framework  of  empirical  risk  minimization 
and  marginalized  kernels,  and  analyze  its  computational  and  statistical  properties  both  theoretically  and 
empirically.  We  provide  an  efficient  implementation  of  the  algorithm,  and  demonstrate  its  performance 
on  both  simulated  and  real  data  sets. 


1  Introduction 

A  decentralized  detection  system  typically  involves  a  set  of  sensors  that  receive  observations  from  the  envi¬ 
ronment,  but  arc  permitted  to  transmit  only  a  summary  message  (as  opposed  to  the  full  observation)  back  to 
a  fusion  center.  On  the  basis  of  its  received  messages,  this  fusion  center  then  chooses  a  final  decision  from 
some  number  of  alternative  hypotheses  about  the  environment.  The  problem  of  decentralized  detection  is  to 
design  the  local  decision  rules  at  each  sensor,  which  determine  the  messages  that  are  relayed  to  the  fusion 
center,  as  well  a  decision  rule  for  the  fusion  center  itself  [28].  A  key  aspect  of  the  problem  is  the  presence 
of  communication  constraints ,  meaning  that  the  sizes  of  the  messages  sent  by  the  sensors  back  to  the  fusion 
center  must  be  suitably  “small”  relative  to  the  raw  observations,  whether  measured  in  terms  of  either  bits  or 
power.  The  decentralized  nature  of  the  system  is  to  be  contrasted  with  a  centralized  system,  in  which  the 
fusion  center  has  access  to  the  full  collection  of  raw  observations. 

Such  problems  of  decentralized  decision-making  have  been  the  focus  of  considerable  research  in  the  past 
two  decades  [e.g.,  27,  28,  7,  8],  Indeed,  decentralized  systems  arise  in  a  variety  of  important  applications, 
ranging  from  sensor  networks,  in  which  each  sensor  operates  under  severe  power  or  bandwidth  constraints. 
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to  the  modeling  of  human  decision-making,  in  which  high-level  executive  decisions  are  frequently  based 
on  lower-level  summaries.  The  large  majority  of  the  literature  is  based  on  the  assumption  that  the  proba¬ 
bility  distributions  of  the  sensor  observations  lie  within  some  known  parametric  family  (e.g.,  Gaussian  and 
conditionally  independent),  and  seek  to  characterize  the  structure  of  optimal  decision  rules.  The  probability 
of  error  is  the  most  common  performance  criterion,  but  there  has  also  been  a  significant  amount  of  work 
devoted  to  other  criteria,  such  as  the  Neyman-Pearson  or  minimax  formulations.  See  Tsitsiklis  [28]  and 
Blum  et  al.  [7]  for  comprehensive  surveys  of  the  literature. 

More  concretely,  let  Y  e  {— 1,4-1}  be  a  random  variable,  representing  the  two  possible  hypotheses 
in  a  binary  hypothesis-testing  problem.  Moreover,  suppose  that  the  system  consists  of  S  sensors,  each 
of  which  observes  a  single  component  of  the  5-dimensional  vector  X  =  {2T1, . . . ,  Xs}.  One  stalling 
point  is  to  assume  that  the  joint  distribution  P(X,  Y)  falls  within  some  parametric  family.  Of  course,  such 
an  assumption  raises  the  modeling  issue  of  how  to  determine  an  appropriate  parametric  family,  and  how  to 
estimate  parameters.  Both  of  these  problems  arc  very  challenging  in  contexts  such  as  sensor  networks,  given 
highly  inhomogeneous  distributions  and  a  large  number  5  of  sensors.  Our  focus  in  this  paper  is  on  relaxing 
this  assumption,  and  developing  a  method  in  which  no  assumption  about  the  joint  distribution  P(X,  Y)  is 
required.  Instead,  we  posit  that  a  number  of  empirical  samples  (xj,  yt)''=l  are  given. 

In  the  context  of  centralized  signal  detection  problems,  there  is  an  extensive  line  of  research  on  non- 
parametric  techniques,  in  which  no  specific  parametric  form  for  the  joint  distribution  P(X,  Y)  is  assumed 
(see,  e.g.,  Kassam  [19]  for  a  survey).  In  the  decentralized  setting,  however,  it  is  only  relatively  recently  that 
nonparametric  methods  for  detection  have  been  explored.  Several  authors  have  taken  classical  nonparamet- 
ric  methods  from  the  centralized  setting,  and  shown  how  they  can  also  be  applied  in  a  decentralized  system. 
Such  methods  include  schemes  based  on  Wilcoxon  signed-rank  test  statistic  [33,  23],  as  well  as  the  sign 
detector  and  its  extensions  [13,  1,  15].  These  methods  have  been  shown  to  be  quite  effective  for  certain 
types  of  joint  distributions. 

Our  approach  to  decentralized  detection  in  this  paper  is  based  on  a  combination  of  ideas  from  reproducing- 
kernel  Hilbert  spaces  [2,  25],  and  the  framework  of  empirical  risk  minimization  from  nonparametric  statis¬ 
tics.  Methods  based  on  reproducing-kernel  Hilbert  spaces  (RKHSs)  have  figured  prominently  in  the  litera¬ 
ture  on  centralized  signal  detection  and  estimation  for  several  decades  [e.g.,  34,  17,  18].  More  recent  work 
in  statistical  machine  learning  [e.g.,  26]  has  demonstrated  the  power  and  versatility  of  kernel  methods  for 
solving  classification  or  regression  problems  on  the  basis  of  empirical  data  samples.  Roughly  speaking, 
kernel-based  algorithms  in  statistical  machine  learning  involve  choosing  a  function,  which  though  linear 
in  the  RKHS,  induces  a  nonlinear  function  in  the  original  space  of  observations.  A  key  idea  is  to  base 
the  choice  of  this  function  on  the  minimization  of  a  regularized  empirical  risk  functional.  This  functional 
consists  of  the  empirical  expectation  of  a  convex  loss  function  <f>,  which  represents  an  upper  bound  on  the 
0-1  loss  (the  0-1  loss  corresponds  to  the  probability  of  error  criterion),  combined  with  a  regularization  term 
that  restricts  the  optimization  to  a  convex  subset  of  the  RKHS.  It  has  been  shown  that  suitable  choices  of 
margin-based  convex  loss  functions  lead  to  algorithms  that  arc  robust  both  computationally  [26],  as  well  as 
statistically  [35,  3].  The  use  of  kernels  in  such  empirical  loss  functions  greatly  increases  their  flexibility,  so 
that  they  can  adapt  to  a  wide  range  of  underlying  joint  distributions. 

In  this  paper,  we  show  how  kernel-based  methods  and  empirical  risk  minimization  arc  naturally  suited 
to  the  decentralized  detection  problem.  More  specifically,  a  key  component  of  the  methodology  that  we 
propose  involves  the  notion  of  a  marginalized  kernel ,  where  the  marginalization  is  induced  by  the  trans¬ 
formation  from  the  observations  X  to  the  local  decisions  Z.  The  decision  rules  at  each  sensor,  which 
can  be  either  probabilistic  or  deterministic,  arc  defined  by  conditional  probability  distributions  of  the  form 
Q(Z\X),  while  the  decision  at  the  fusion  center  is  defined  in  terms  of  Q(Z\X)  and  a  linear  function  over 
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the  corresponding  RKHS.  We  develop  and  analyze  an  algorithm  for  optimizing  the  design  of  these  decision 
rules.  It  is  interesting  to  note  that  this  algorithm  is  similar  in  spirit  to  a  suite  of  locally  optimum  detectors 
in  the  literature  [e.g.,  7],  in  the  sense  that  one  step  consists  of  optimizing  the  decision  rule  at  a  given  sensor 
while  fixing  the  decision  rules  of  the  rest,  whereas  another  step  involves  optimizing  the  decision  rule  of  the 
fusion  center  while  holding  fixed  the  local  decision  rules  at  each  sensor.  Our  development  relies  heavily  on 
the  convexity  of  the  loss  function  f,  which  allows  us  to  leverage  results  from  convex  analysis  [24]  so  as  to 
derive  an  efficient  optimization  procedure.  In  addition,  we  analyze  the  statistical  properties  of  our  algorithm, 
and  provide  probabilistic  bounds  on  its  performance. 

While  the  thrust  of  this  paper  is  to  explore  the  utility  of  recently-developed  ideas  from  statistical  ma¬ 
chine  learning  for  distributed  decision-making,  our  results  also  have  implications  for  machine  learning.  In 
particular,  it  is  worth  noting  that  most  of  the  machine  learning  literature  on  classification  is  abstracted  away 
from  considerations  of  an  underlying  communication-theoretic  infrastructure.  Such  limitations  may  prevent 
an  algorithm  from  aggregating  all  relevant  data  at  a  central  site.  Therefore,  the  general  approach  described 
in  this  paper  suggests  interesting  research  directions  for  machine  learning — specifically,  in  designing  and 
analyzing  algorithms  for  communication-constrained  environments. 

The  remainder  of  the  paper  is  organized  as  follows.  In  Section  2,  we  provide  a  formal  statement  of  the 
decentralized  decision-making  problem,  and  show  how  it  can  be  cast  as  a  learning  problem.  In  Section  3,  we 
present  a  kernel-based  algorithm  for  solving  the  problem,  and  we  also  derive  bounds  on  the  performance  of 
this  algorithm.  Section  4  is  devoted  to  the  results  of  experiments  using  our  algorithm,  in  application  to  both 
simulated  and  real  data.  Finally,  we  conclude  the  paper  with  a  discussion  of  future  directions  in  Section  5. 

2  Problem  formulation  and  a  simple  strategy 

In  this  section,  we  begin  by  providing  a  precise  formulation  of  the  decentralized  detection  problem  to  be 
investigated  in  this  paper,  and  show  how  it  can  be  formulated  in  terms  of  statistical  learning.  We  then 
describe  a  simple  strategy  for  designing  local  decision  rules,  based  on  an  optimization  problem  involving 
the  empirical  risk.  This  strategy,  though  naive,  provides  intuition  for  our  subsequent  development  based  on 
kernel  methods. 

2.1  Formulation  of  the  decentralized  detection  problem 

Suppose  Y  is  a  discrete-valued  random  variable,  representing  a  hypothesis  about  the  environment.  Although 
the  methods  that  we  describe  are  more  generally  applicable,  the  focus  of  this  paper  is  the  binary  case,  in 
which  the  hypothesis  variable  Y  takes  values  in  y  :=  {  — 1,+1}.  Our  goal  is  to  form  an  estimate  Y 
of  the  true  hypothesis,  based  on  observations  collected  from  a  set  of  S  sensors.  More  specifically,  each 
t  =  1, . . . ,  S,  let  X*  G  X  represent  the  observation  at  sensor  t,  where  X  denotes  the  observation  space.  The 
full  set  of  observations  corresponds  to  the  S'-dimensional  random  vector  X  =  (X 1, . . . ,  A5)  £  Xs ,  drawn 
from  the  conditional  distribution  P(X\Y). 

We  assume  that  the  global  estimate  Y  is  to  be  formed  by  a.  fusion  center.  In  the  centralized  setting,  this 
fusion  center  is  permitted  access  to  the  full  vector  X  =  (X1, . . . ,  Xs)  of  observations.  In  this  case,  it  is 
well-known  [31]  that  optimal  decision  rules,  whether  under  the  Bayes  error  or  the  Neyman- Pearson  criteria, 
can  be  formulated  in  terms  of  the  likelihood  ratio  P{X\Y  =  1 )  /  P(X\Y  =  —1).  In  contrast,  the  defining 
feature  of  the  decentralized  setting  is  that  the  fusion  center  has  access  only  to  some  form  of  summary  of  each 
observation  X*,  t  =  1, ...  S.  More  specifically,  we  suppose  that  each  each  sensor  t  =  1 ...  ..S'  is  permitted 
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to  transmit  a  message  Zt,  taking  values  in  some  space  Z.  The  fusion  center,  in  turn,  applies  some  decision 
rule  7  to  compute  an  estimate  Y  =  7 (Z1, . . . ,  Zs )  of  Y  based  on  its  received  messages. 

In  this  paper,  we  focus  on  the  case  of  a  discrete  observation  space — say  X  =  {1.2....,  M }.  The 
key  constraint,  giving  rise  to  the  decentralized  nature  of  the  problem,  is  that  the  corresponding  message 
space  Z  =  {1, . . . ,  L}  is  considerably  smaller  than  the  observation  space  (i.e.,  L  <C  M).  The  problem  is 
to  find,  for  each  sensor  t  =  1, . . . ,  S,  a  decision  rule  7*  :  X1  — >  Zl ,  as  well  as  an  overall  decision  rule 
7  :  Zs  — >  {—1,  +1}  at  the  fusion  center  so  as  to  minimize  the  Bayes  risk  P{Y  /  7 (Z)).  We  assume  that 
the  joint  distribution  P(X,  Y)  is  unknown,  but  that  we  arc  given  n  independent  and  identically  distributed 
(i.i.d.)  data  points  (a 7,  ?/j)"=1  sampled  from  P(X,  Y). 


Figure  1.  Decentralized  detection  system  with  S  sensors,  in  which  Y  is  the  unknown  hypothesis, 
X  =  {X 1, . . . ,  Xs)  is  the  vector  of  sensor  observations;  and  Z  =  (Z1, . . . ,  Zs)  are  the  quantized  messages 
transmitted  from  sensors  to  the  fusion  center. 


Figure  1  provides  a  graphical  representation  of  this  decentralized  detection  problem.  The  single  node  at 
the  top  of  the  figure  represents  the  hypothesis  variable  Y,  and  the  outgoing  arrows  point  to  the  collection  of 
observations  X  =  (X1, ... ,  X5).  The  local  decision  rules  7*  lie  on  the  edges  between  sensor  observations 
X1  and  messages  Zt.  Finally,  the  node  at  the  bottom  is  the  fusion  center,  which  collects  all  the  messages. 

Although  the  Bayes-optimal  risk  can  always  be  achieved  by  a  deterministic  decision  rule  [28],  consid¬ 
ering  the  larger  space  of  stochastic  decision  rules  confers  some  important  advantages.  First,  such  a  space 
can  be  compactly  represented  and  parameterized,  and  prior  knowledge  can  be  incorporated.  Second,  the  op¬ 
timal  deterministic  rules  are  often  very  hard  to  compute,  and  a  probabilistic  rule  may  provide  a  reasonable 
approximation  in  practice.  Accordingly,  we  represent  the  rule  for  the  sensors  t  =  1 ..... .S'  by  a  conditional 

probability  distribution  Q(Z\X ).  The  fusion  center  makes  its  decision  by  applying  a  deterministic  function 
x(z)  of  z.  The  overall  decision  rule  (Q,  7)  consists  of  the  individual  sensor  rules  and  the  fusion  center  rule. 

The  decentralization  requirement  for  our  detection/classification  system — i.e.,  that  the  decision  rule  for 
sensor  t  must  be  a  function  only  of  the  observation  x1 — can  be  translated  into  the  probabilistic  statement 
that  Z1, ... ,  Zs  be  conditionally  independent  given  X: 

s 

Q(Z\X)  =  l[Qt(Zt\Xt).  (1) 

t.= 1 
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In  fact,  this  constraint  turns  out  to  be  advantageous  from  a  computational  perspective,  as  will  be  clarified 
in  the  sequel.  We  use  Q  to  denote  the  space  of  all  factorized  conditional  distributions  Q(Z\X),  and  Q$  to 
denote  the  subset  of  factorized  conditional  distributions  that  arc  also  deterministic. 


2.2  A  simple  strategy  based  on  minimizing  empirical  risk 

Suppose  that  we  have  as  our  training  data  n  pairs  (a;.,, ;/, )  for  i  =  1, . . . ,  n.  Note  that  each  a;,,  as  a  particular 
realization  of  the  random  vector  X,  is  an  S  dimensional  signal  vector  xt  =  (x\ , . . .  ,xf)  €  Xs .  Let  P 
be  the  unknown  underlying  probability  distribution  for  (A,  Y).  The  probabilistic  set-up  makes  it  simple  to 
estimate  the  Bayes  risk,  which  is  to  be  minimized. 

Consider  a  collection  of  local  decision  rules  made  at  the  sensors,  which  we  denote  by  Q(Z\X).  For 
each  such  set  of  rules,  the  associated  Bayes  risk  is  defined  by: 


R 


opt 


P{Y  =  1|  Z)  -  P(Y 


-l\Z) 


(2) 


Here  the  expectation  E  is  with  respect  to  the  probability  distribution  P(X.  Y,  Z)  :=  P(X,  Y)Q(Z\X).  It 
is  clear  that  no  decision  rule  at  the  fusion  center  (i.e.,  having  access  only  to  z)  has  Bayes  risk  smaller  than 
Ropt-  In  addition,  the  Bayes  risk  Ropt  can  be  achieved  by  using  the  decision  function 

If  opt  (z)  =  sign(P(F  =  1|  z)  -  P(Y  =  -l|z)). 


It  is  key  to  observe  that  this  optimal  decision  rule  cannot  be  computed,  because  P(X,  Y)  is  not  known,  and 
Q(Z\X)  is  to  be  determined.  Thus,  our  goal  is  to  determine  the  rule  Q(Z\X)  that  minimizes  an  empirical 
estimate  of  the  Bayes  risk  based  on  the  training  data  (aq,  y,;)”=1.  In  Lemma  1  we  show  that  the  following  is 
one  such  unbiased  estimate  of  the  Bayes  risk: 


R 


emp 


1 

2 


1 

2  n 


n 

£l£  Q(z\xi)yi 

z  i= 1 


(3) 


In  addition,  7 opt{z)  can  be  estimated  by  the  decision  function  7 emp{z)  =  sign(  Q(z\xi)yi) ■  Since  Z 
is  a  discrete  random  vector,  the  optimal  Bayes  risk  can  be  estimated  easily,  regardless  of  whether  the  input 
signal  X  is  discrete  or  continuous. 


Lemma  1.  (a)  Assume  that  P(z)  >  0  for  all  z.  Define 

/  \  =  E£=i  Qiz\xMm  =  1) 

[)  E?=i  Q{z\Xi)  ■ 

Then  linin^oo  k(z)  =  P(Y  =  1| z). 

(b)  As  n  — >  00,  Remp  and  7 emp(z)  tend  to  Ropt  and  7 opt(z),  respectively. 

Proof.  See  Appendix  1.  □ 


The  significance  of  Lemma  1  is  in  motivating  the  goal  of  finding  decision  rules  Q(Z\X)  to  minimize 
the  empirical  error  Remp.  It  is  equivalent,  using  equation  (3),  to  maximize 


C(Q)  =  Q(Z\Xi)Vi 


i= 1 


(4) 
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subject  to  the  constraints  that  define  a  probability  distribution: 


Q(z \x)  =  UliQW) 
J2zt  Qt{zt\xt )  =  1 
Qt(zt\xt)  G  [0, 1] 


for  all  values  of  2  and  x. 
for  t  =  1, . . . ,  S, 
for  t  =  1, . . . ,  S. 


(5) 


The  major  computational  difficulty  in  the  optimization  problem  defined  by  equations  (4)  and  (5)  lies  in  the 
summation  over  all  Ls  possible  values  of  2  G  Zs .  One  way  to  avoid  this  obstacle  is  by  maximizing  instead 
the  following  function: 


C2(Q) 


EE  Q(z\xi)yi 
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Expanding  the  square  and  using  the  conditional  independence  condition  (1)  leads  to  the  following  equivalent 
form  for  Co : 

S  L 

C2  (Q)  =  'Ewj  II  £  Qt{zt\xtf)Qt(zt\xtj).  (6) 

i,j  t=l  z*=l 

Note  that  the  conditional  independence  condition  (1)  on  Q  allow  us  to  compute  C'2 ( Q )  in  O(SL)  time,  as 
opposed  to  0(LS). 

While  this  simple  strategy  is  based  directly  on  the  empirical  risk,  it  does  not  exploit  any  prior  knowledge 
about  the  class  of  discriminant  functions  for  7(2).  As  we  discuss  in  the  following  section,  such  knowledge 
can  be  incorporated  into  the  classifier  using  kernel  methods.  Moreover,  the  kernel-based  decentralized 
detection  algorithm  that  we  develop  turns  out  to  have  an  interesting  connection  to  the  simple  approach 
based  on  C^Q). 


3  A  kernel-based  algorithm 

In  this  section,  we  turn  to  methods  for  decentralized  detection  based  on  empirical  risk  minimization  and 
kernel  methods  [2,  25,  26].  We  begin  by  introducing  some  background  and  definitions  necessary  for  sub¬ 
sequent  development.  We  then  motivate  and  describe  a  central  component  of  our  decentralized  detection 
system — namely,  the  notion  of  a  marginalized  kernel.  Our  method  for  designing  decision  rules  is  based  on 
an  optimization  problem,  which  we  show  how  to  solve  efficiently.  Finally,  we  derive  theoretical  bounds  on 
the  performance  of  our  decentralized  detection  system. 

3.1  Empirical  risk  minimization  and  kernel  methods 

In  this  section,  we  provide  some  background  on  empirical  risk  minimization  and  kernel  methods.  The 
exposition  given  here  is  necessarily  very  brief;  we  refer  the  reader  to  the  books  [26,  25,  34]  for  more  details. 
Our  starting  point  is  to  consider  estimating  V  with  a  rule  of  the  form  y(x)  =  sign f(x),  where  /  :  X  — >  R  is 
a  discriminant  function  that  lies  within  some  function  space  to  be  specified.  The  ultimate  goal  is  to  choose 
a  discriminant  function  /  to  minimize  the  Bayes  error  P(Y  f  Y),  or  equivalently  to  minimize  the  expected 
value  of  the  following  0-1  loss: 


Mvf(x ))  :=  I[y  +  sign(/(x))]. 


(V) 
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This  minimization  is  intractable,  both  because  the  function  fio  is  not  well-behaved  (i.e.,  non-convex  and 
non-differentiable),  and  because  the  joint  distribution  P  is  unknown.  However,  since  we  are  given  a  set 
of  i.i.d.  samples  {(tCj,  t/i)}”=1,  it  is  natural  to  consider  minimizing  a  loss  function  based  on  an  empirical 
expectation ,  as  motivated  by  our  development  in  Section  2.2.  Moreover,  it  turns  out  to  be  fruitful,  for  both 
computational  and  statistical  reasons,  to  design  loss  functions  based  on  convex  surrogates  to  the  0-1  loss. 

Indeed,  a  variety  of  classification  algorithms  in  statistical  machine  learning  have  been  shown  to  involve 
loss  functions  that  can  be  viewed  as  convex  upper  bounds  on  the  0-1  loss.  For  example,  the  support  vector 
machine  (SVM)  algorithm  [9,  26]  uses  a  hinge  loss  function: 

Mv fix))  ■■=  (l-yf(x))+  =  max{l -y/(x),0}.  (8) 

On  the  other  hand,  the  logistic  regression  algorithm  [12]  is  based  on  the  logistic  loss  function: 

Mvfix ))  :=  log  [l  d-exp-®^)]-1.  (9) 

Finally,  the  standard  form  of  the  boosting  classification  algorithm  [11]  uses  a  exponential  loss  function: 

fa  (yf(x))  ■=  exp  (~yf(x)).  (10) 

Intuition  suggests  that  a  function  /  with  small  d>-risk  E fi(Y  f(X))  should  also  have  a  small  Bayes  risk 
P(Y  f  sign(/(X ))).  In  fact,  it  has  been  established  rigorously  that  convex  surrogates  for  the  (non-convex) 
0-1  loss  function,  such  as  the  hinge  (8)  and  logistic  loss  (9)  functions,  have  favorable  properties  both  com¬ 
putationally  (i.e.,  algorithmic  efficiency),  and  in  a  statistical  sense  (i.e.,  bounds  on  estimation  error)  [35,  3]. 

We  now  turn  to  consideration  of  the  function  class  from  which  the  discriminant  function  /  is  to  be 
chosen.  Kernel-based  methods  for  discrimination  entail  choosing  /  from  within  a  function  class  defined  by 
a  positive  semidefinite  kernel,  defined  as  follows  (see  [25]): 

Definition  2.  A  real-valued  kernel  function  is  a  symmetric  bilinear  mapping  Kx  :  X  x  X  — >  R.  It  is 
positive  semidefinite,  which  means  that  for  any  subset  {, x\ , . . . ,  xn}  drawn  from  X,  the  Gram  matrix  Kij  = 
Kx(xi,  Xj)  is  positive  semidefinite. 

Given  any  such  kernel,  we  first  define  a  vector  space  of  functions  mapping  X  to  the  real  line  R  through 
all  sums  of  the  form 

m 

/(•)  =  ^2<XjKx(-,Xj),  (11) 

3= 1 

where  { %j  }  \  are  arbitrary  points  from  X,  and  ay  G  R.  We  can  equip  this  space  with  a  kernel-based  inner 

product  by  defining  (Kx(-,  a;,.),  Kx(-,Xj))  :=  Kx(xt,  ay),  and  then  extending  this  definition  to  the  full 
space  by  bilinearity.  Note  that  this  inner  product  induces,  for  any  function  of  the  form  (11),  the  kernel-based 
norm  ||/||^  =  Eij=i  aioy-M®*,  xj). 

Definition  3.  The  reproducing  kernel  Hilbert  space  hi.  associated  with  a  given  kernel  Kx  consists  of  the 
kernel-based  inner  product,  and  the  closure  (in  the  kernel-based  norm)  of  all  functions  of  the  form  (11). 

As  an  aside,  the  term  “reproducing”  stems  from  the  fact  for  any  /  G  hi,  we  have  (/,  Kx(-,Xi))  =  f(xi), 
showing  that  the  kernel  acts  as  the  representer  of  evaluation  [25]. 
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In  the  framework  of  empirical  risk  minimization,  the  discriminant  function  /  e  7i  is  chosen  by  mini¬ 
mizing  a  cost  function  given  by  the  sum  of  the  empirical  fi-risk  E(P(Y  f(X))  and  a  suitable  regularization 
term 

n 

+  fiWfWn,  (!2) 

1=1 

where  A  >  0  is  a  regularization  parameter.  The  Representer  Theorem  (Thm.  4.2;  [26])  guarantees  that  the 
optimal  solution  to  problem  (12)  can  be  written  in  the  form  f(x)  =  \  aiUiKx(x,  xf),  for  a  particular 

vector  a  €  Mn.  The  key  here  is  that  sum  ranges  only  over  the  observed  data  points  {(xi,yi)}™=1. 

For  the  sake  of  development  in  the  sequel,  it  will  be  convenient  to  express  functions  /  G  H  as  linear 
discriminants  involving  the  the  feature  map  Fix)  :=  Kx(-,x).  (Note  that  for  each  x  G  X,  the  quantity 
<I>(x)  =  fI> ( .r; )  ( ■ )  is  a  function  from  X  to  the  real  line  R.)  Any  function  /  in  the  Hilbert  space  can  be  written 
as  a  linear  discriminant  of  the  form  (w.  <b(x)}  for  some  function  w  £  El.  (In  fact,  by  the  reproducing 
property,  we  have  /(•)  =  w(-)).  As  a  particular  case,  the  Representer  Theorem  allows  us  to  write  the 
optimal  discriminant  as  f(x)  =  (w,  <b(x)),  where  w  =  YHP=\  aiUi^(xi)- 

3.2  Fusion  center  and  marginalized  kernels 

With  this  background,  we  first  consider  how  to  design  the  decision  rule  7  at  the  fusion  center  for  a  fixed  set¬ 
ting  Q(Z\X )  of  the  sensor  decision  rules.  Since  the  fusion  center  rule  can  only  depend  on  z  =  (z1 , ,  zs), 
our  starting  point  is  a  feature  space  {<3?'(z)}  with  associated  kernel  Kz.  Following  the  development  in  the 
previous  section,  we  consider  fusion  center  rules  defined  by  taking  the  sign  of  a  linear  discriminant  of  the 
form  7 (z)  :=  (w.  (l>' (z)).  We  then  link  the  performance  of  7  to  another  kernel-based  discriminant  func¬ 
tion  /  that  acts  directly  on  x  =  (x1, . . . ,  xs),  where  the  new  kernel  Kq  associated  with  /  is  defined  as  a 
marginalized  kernel  in  terms  of  Q(Z\X)  and  AT. 

The  relevant  optimization  problem  is  to  minimize  (as  a  function  of  w)  the  following  regularized  form  of 
the  empirical  7-risk  associated  with  the  discriminant  7 

n  ^ 

min  {EE  <t>{yn(z))Q(z\xi)  +  -|M|2j,  (13) 

z  i= 1 

where  A  >  0  is  a  regularization  parameter.  In  its  current  form,  the  objective  function  (13)  is  intractable  to 
compute  (because  it  involves  summing  over  all  Ls  possible  values  of  z  of  a  loss  function  that  is  generally 
non-decomposable).  However,  exploiting  the  convexity  of  <f>  allows  us  to  perform  the  computation  exactly 
for  deterministic  rules  in  Qq,  and  also  leads  to  a  natural  relaxation  for  an  arbitrary  decision  rule  Q  G  Q. 
This  idea  is  formalized  in  the  following: 

Proposition  4.  Define  the  quantities 

^q(x)  :=  ^2Q(z  and  fix'i  Q)  '■=  (w,  $q(x)).  (14) 

z 

For  any  convex  (p,  the  optimal  value  of  the  following  optimization  problem  is  a  lower  bound  on  the  optimal 
value  in  problem  (13): 

min  y2fi(yif(xi-Q))  +  ^\\w\\2  (15) 

w  z J  Z 

i 

Moreover,  the  relaxation  is  tight  for  any  deterministic  rule  Q(Z\X). 


Proof.  Applying  Jensen’s  inequality  to  the  function  0  yields  <j>{yif{xi\Q ))  <  Yhz  cl)(yi'Y(z))Q(z\xi)  f°r 
each  i  =  1, . . .  n,  from  which  the  lower  bound  follows.  Equality  for  deterministic  Q  7  Qo  is  immediate. 

□ 

A  key  point  is  that  the  modified  optimization  problem  (15)  involves  an  ordinary  regularized  empirical 
0-loss,  but  in  terms  of  a  linear  discriminant  function  f(x:  Q)  =  (w.  4>q(x))  in  the  transformed  feature 
space  {<hg(x)}  defined  in  equation  (14).  Moreover,  the  corresponding  marginalized  kernel  function  takes 
the  form: 

I\q(x,  x')  :=  ^  Q{z\x)Q(z'\x')  Kz(z,  z'),  (16) 

z,z' 

where  Kz(z,z')  :=  (4>'(z),  iz'))  is  the  kernel  in  {<b'(z)}-space.  It  is  straightforward  to  see  that  the 

positive  semidefiniteness  of  Kz  implies  that  Kq  is  also  a  positive  semidefinite  function. 

From  a  computational  point  of  view,  we  have  converted  the  marginalization  over  loss  function  values 
to  a  marginalization  over  kernel  functions.  While  the  former  is  intractable,  the  latter  marginalization  can 
be  carried  out  in  many  cases  by  exploiting  the  structure  of  the  conditional  distributions  Q(Z\X).  (In  Sec¬ 
tion  3.3,  we  provide  several  examples  to  illustrate.)  From  the  modeling  perspective,  it  is  interesting  to 
note  that  marginalized  kernels,  like  that  of  equation  (16),  underlie  recent  work  that  aims  at  combining  the 
advantages  of  graphical  models  and  Mercer  kernels  [16,  29]. 

As  a  standard  kernel-based  formulation,  the  optimization  problem  (15)  can  be  solved  by  the  usual  Fa- 
grangian  dual  formulation  [26],  thereby  yielding  an  optimal  weight  vector  w.  This  weight  vector  defines  the 
decision  rule  for  the  fusion  center  by  7 (z)  :=  (w.  ^'(z)).  By  the  Representer  Theorem  [26],  the  optimal 
solution  w  to  problem  (15)  has  an  expansion  of  the  form 

n  n 

W  =  E  am$Q(xi)  =  EE  <XiyiQ{z!\xi)$'  (z'), 

i= 1  2—1  z' 

where  a  is  an  optimal  dual  solution,  and  the  second  equality  follows  from  the  definition  of  4>q(x)  given  in 
equation  (14).  Substituting  this  decomposition  of  w  into  the  definition  of  7  yields 

n 

7(2)  :=  EE  oiiyiQ{z'\xi)Kz(z,z').  (17) 

z'  i= 1 

Note  that  there  is  an  intuitive  connection  between  the  discriminant  functions  /  and  7.  In  particular,  using  the 
definitions  of  /  and  Kq,  it  can  be  seen  that  f(x)  =  E[7(Z)|x],  where  the  expectation  is  taken  with  respect 
to  Q(Z\X  =  x).  The  interpretation  is  quite  natural:  when  conditioned  on  some  x.  the  average  behavior  of 
the  discriminant  function  7 (Z),  which  does  not  observe  x,  is  equivalent  to  the  optimal  discriminant  f(x), 
which  does  have  access  to  x. 

3.3  Design  and  computation  of  marginalized  kernels 

As  seen  in  the  previous  section,  the  representation  of  discriminant  functions  /  and  7  depends  on  the  kernel 
functions  Kz(z,  z')  and  Kq(x,  x'),  and  not  on  the  explicit  representation  of  the  underlying  feature  spaces 
{4>'(z)}  and  {4'gf.x) }.  It  is  also  shown  in  the  next  section  that  our  algorithm  for  solving  /  and  7  requires 
only  the  knowledge  of  the  kernel  functions  Kz  and  Kq.  Indeed,  the  effectiveness  of  a  kernel-based  algorithm 
typically  hinges  heavily  on  the  design  and  computation  of  its  kernel  function(s). 
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Accordingly,  let  us  now  consider  the  computational  issues  associated  with  marginalized  kernel  Kq, 
assuming  that  I\z  has  already  been  chosen.  In  general,  the  computation  of  Kq(x ,  x')  entails  marginalizing 
over  the  variable  Z,  which  (at  first  glance)  has  computational  complexity  on  the  order  of  0(LS).  However, 
this  calculation  fails  to  take  advantage  of  any  structure  in  the  kernel  function  Kz.  More  specifically,  it  is 
often  the  case  that  the  kernel  function  Kz(z ,  z')  can  be  decomposed  into  local  functions,  in  which  case  the 
computational  cost  is  considerably  lower.  Here  we  provide  a  few  examples  of  computationally  tractable 
kernels. 

Computationally  tractable  kernels: 

(a)  Perhaps  the  simplest  example  is  the  linear  kernel  Kz(z,  z')  =  J^f=1  ztzrt,  for  which  it  is  straightfor¬ 
ward  to  derive  Kq(x,  x')  =  Y2t=l  ^[2*1®*]  E[zrt|a;/*]. 

(b)  A  second  example,  natural  for  applications  in  which  X1  and  Zt  are  discrete  random  variables,  is 
the  count  kernel.  Let  us  represent  each  discrete  value  u  €  {1, . . . ,  M }  as  a  M -dimensional  vec¬ 
tor  (0, . . . ,  1, . . . ,  0),  whose  rt-th  coordinate  takes  value  1.  If  we  define  the  first-order  count  kernel 
Kz(z ,  z')  :=  Ylt=i  ^ W  =  z>1]  -  then  the  resulting  marginalized  kernel  takes  the  form: 

s  s 

Kq(x,x')  =  ^2  Q{z\x)Q(z'\x')  ^2  H[z*  =  z'f\  =  Qjz*  =  xn).  (18) 

z,z'  t=  1  t=  1 

(c)  A  natural  generalization  is  the  second-order  count  kernel  Kz(z,  z')  =  ^*.r=l  ll[zf'  =  z't]I[zr  = 
z'r]  that  accounts  for  the  pairwise  interaction  between  coordinates  zl  and  zr .  For  this  example,  the 
associated  marginalized  kernel  Kq(x ,  x')  takes  the  form: 

2  Q(zt  =  z,t\xt,x,t)Q(zr  =  z'r\xr,x,r).  (19) 

1  <t<r<S 

Remarks:  First,  note  that  even  for  a  linear  base  kernel  Kz,  the  kernel  function  I\q  inherits  additional 
(nonlinear)  structure  from  the  marginalization  over  Q(Z \X).  As  a  consequence,  the  associated  discriminant 
functions  (i.e.,  7  and  /)  arc  certainly  not  linear.  Second,  our  formulation  allows  any  available  prior  knowl¬ 
edge  to  be  incorporated  into  Kq  in  at  least  two  possible  ways:  (i)  The  base  kernel  representing  a  similarity 
measure  in  the  quantized  space  of  z  can  reflect  the  structure  of  the  sensor  network,  or  (ii)  More  structured 
decision  rules  Q(Z\X)  can  be  considered,  such  as  chain  or  tree-structured  decision  rules. 

3.4  Joint  optimization 

Our  next  task  is  to  perform  joint  optimization  of  both  the  fusion  center  rule,  defined  by  w  (or  equivalently 
a,  as  in  equation  (17)),  and  the  sensor  rules  Q.  Observe  that  the  cost  function  (15)  can  be  re-expressed  as  a 
function  of  both  w  and  Q  as  follows: 

G(w;Q)  :=  ^  X]  +  \\\w\\2 ■  (20) 

i  '  z  ' 

Of  interest  is  the  joint  minimization  of  the  function  G  in  both  w  and  0.  It  can  be  seen  easily  that 
(a)  G  is  convex  in  w  with  Q  fixed;  and 
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(b)  moreover,  G  is  convex  in  Qt,  when  both  w  and  all  other  {Qr,  r  f  t  \  arc  fixed. 

These  observations  motivate  the  use  of  blockwise  coordinate  gradient  descent  to  perform  the  joint  mini¬ 
mization. 

Optimization  of  w:  As  described  in  Section  3.2,  when  Q  is  fixed,  then  rninu,  G(w:  Q )  can  be  computed 
efficiently  by  a  dual  reformulation.  Specifically,  as  we  establish  in  the  following  result  using  ideas  from 
convex  duality  [24],  a  dual  reformulation  of  rninu,  G(w:  Q)  is  given  by 

f  1  ■,  1  ) 

maxj  - -aTl{yyT)oKQ\a\,  (21) 

where  <f*(u)  :=  supweK  {u  ■  v  —  <f(v)}  is  the  conjugate  dual  of  f,  [Kq]^  :=  Kq(xi,Xj )  is  the  empirical 
kernel  matrix,  and  o  denotes  Hadamard  product. 

Proposition  5.  For  each  fixed  Q  E  Q,  the  value  of  the  primal  problem  i:  if,,,  G  (w:  Q )  is  attained  and  equal  to 
its  dual  form  (21).  Furthermore,  any  optimal  solution  a  to  problem  (21)  defines  the  optimal  primal  solution 
w(Q)  to  min,,,  G(w;  Q)  via  w(Q )  =  Yn=i  aiVi®Q(xi)- 

Proof.  It  suffices  for  our  current  purposes  to  restrict  to  the  case  where  the  functions  w  and  <1>q(x)  can  be 
viewed  as  vectors  in  some  finite-dimensional  space — say  Rm.  However,  it  is  possible  to  extend  this  approach 
to  the  infinite-dimensional  setting  by  using  conjugacy  in  general  normed  spaces  [21], 

A  remark  on  notation  before  proceeding:  since  Q  is  fixed,  we  drop  Q  from  G  for  notational  convenience 
(i.e.,  we  write  G(w)  =  G(tv:  Q)).  First,  we  observe  that  CAw)  is  convex  with  respect  to  w  and  that  G  — >  oo 
as  ||iu||  — r  oo.  Consequently,  the  inhmum  dehning  the  primal  problem  inf  „.e|Dm  G(w)  is  attained.  We  now 
re-write  this  primal  problem  as  follows: 


inf  G(w )  =  inf  {G(m)  —  ( w ,  0)}  =  —  G*(0), 

we»m  «;e«m 

where  G*  :  Mm  —*■  R  denotes  the  conjugate  dual  of  G. 

Using  the  notation  gfiw )  :=  j4>{{w ,  yi<&Q{xi)))  and  Q(w')  :=  ^||m||2,  we  can  decompose  G  as  the 
sum  G(w)  =  9i(.w )  +  Q(w).  This  decomposition  allows  us  to  compute  the  conjugate  dual  G*  via  the 

inf-convolution  theorem  (Thm.  16.4;  Rock afc liar  [24])  as  follows: 


G*(0) 


inf 

Ui,i=l,...,n 


2=1 


2=1 


(22) 


Applying  calculus  rules  for  conjugacy  operations  (Thm.  16.3;  [24]),  we  obtain: 


9i(ui) 


af)  if  Ui  =  -ai(yi$Q(xi))  for  some  G  R 
Too  otherwise. 


(23) 


A  straightforward  calculation  yields  A.*(v)  =  supu,{(n,  w)  —  \  \  w 1 1 2 }  =  ^  \  \  v 1 1 2 .  Substituting  these  expres¬ 
sions  into  equation  (22)  leads  to: 


G*(0) 


n 

inf  jog)  T  - 

wn  Z—S  \  T  '9 


ckEM' 


2=1 


n  2 

yaiyi<S>Q(xi)  , 

2 
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from  which  it  follows  that 


f  1  x  \  1  x 

inf  G(w)  =  -G*( 0)  =  sup  -  -  V'  (/>*(- Aa*)  -  -  Y]  aiajyiyjKx(xi,  xj ) 

w  ael"  L  A  ,  z  ,  rrr. 

Thus,  we  have  derived  the  dual  form  (21).  See  Appendix  5  for  the  remainder  of  the  proof,  in  which  we 
derive  the  link  between  w(Q )  and  the  dual  variables  a.  □ 

This  proposition  is  significant  in  that  the  dual  problem  involves  only  the  kernel  matrix  (. Kq{xi ,  Xj))i<i,j<n- 
Hence,  one  can  solve  for  the  optimal  discriminant  functions  y  =  fix)  or  y  =  7  (z)  without  requiring  explicit 
knowledge  of  the  underlying  feature  spaces  {T>'(z)}  and  {'hg(x) }.  As  a  particular  example,  consider  the 
case  of  hinge  loss  function  (8),  as  used  in  the  SVM  algorithm  [26].  A  straightforward  calculation  yields 

u  if  u  G  [— 1, 0] 

+00  otherwise. 


Substituting  this  formula  into  (21)  yields,  as  a  special  case,  the  familial-  dual  formulation  for  the  SVM: 


max 

0<cKl/A 


[(yyT)oKQ\a 


Optimization  of  Q:  The  second  step  is  to  minimize  G  over  Q1,  with  w  and  all  other  {Qr,  r  f  t}  held 
fixed.  Our  approach  is  to  compute  the  derivative  (or  more  generally,  the  subdifferential)  with  respect  to  Qt, 
and  then  apply  a  gradient-based  method.  A  challenge  to  be  confronted  is  that  G  is  defined  in  terms  of  feature 
vectors  ^(z),  which  are  typically  high-dimensional  quantities.  Indeed,  although  it  is  intractable  to  evaluate 
the  gradient  at  an  arbitrary  w,  the  following  result  establishes  that  it  can  always  be  evaluated  at  the  point 
(w{Q),  Q )  for  any  Q  <E  Q. 


Lemma  6.  Let  w ( Q )  be  the  optimizing  argument  G{w ;  Q),  and  let  a  be  an  optimal  solution  to  the 

dual  problem  (21).  Then  the  following  element 


-A 


E 


aiCtjQ(z'\xj ) 


Q(z\xj) 

Qt(zt\x\) 


Kz(z,  z')I[x\  =  xf]  I [z*  =  zf] 


is  an  element  of  the  subdifferential.  1 

Proof  See  Appendix  5.  □ 


Observe  that  this  representation  of  the  (sub)gradient  involves  marginalization  over  Q  of  the  kernel  func¬ 
tion  Kz,  and  therefore  can  be  computed  efficiently  in  many  cases,  as  described  in  Section  3.3.  Overall,  the 
blockwise  coordinate  descent  algorithm  for  optimizing  the  choice  of  local  decision  rules  takes  the  following 
form: 


1  Subgradient  is  a  generalized  counterpart  of  gradient  for  non-differentiable  convex  functions.  Briefy,  a  subgradient  of  a  convex 
function  /  :  Rm  — ►  R  at  x  is  a  vector  s  G  Rm  satisfying  f(y)  >  f(x)  +  (s,  y  —  x )  for  all  y  G  Rm.  The  subdifferential  at  a  point 
x  is  the  set  of  all  subgradients;  hence,  if  /  is  differentiable  at  x,  the  subdifferential  consists  of  the  single  vector  (V/(a;)}.  In  our 
cases,  G  is  non-differentiable  when  <p  is  the  hinge  loss  (8),  and  differentiable  when  <j>  is  the  logistic  loss  (9)  or  exponential  loss  (10). 
dQt(zt\xt)G  evaluated  at  ( w(Q ),  Q ).  More  details  on  convex  analysis  can  be  found  in  the  books  [24,  14]. 
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Kernel  quantization  (KQ)  algorithm: 

(a)  With  Q  fixed,  compute  the  optimizing  w(Q)  by  solving  the  dual  problem  (21). 

(b)  For  some  index  t,  fix  w(Q)  and  {Qr,r  ^  t\  and  take  a  gradient  step  in  Q1  using 
Lemma  6. 

Upon  convergence,  we  define  a  deterministic  decision  rule  for  each  sensor  t  via: 

7*(.x*)  :=  &rgm&xzteZQ(zt\xt)-  (24) 


Remarks:  A  number  of  comments  about  this  algorithm  arc  in  order.  At  a  high  level,  the  updates  consist 
of  alternatively  updating  the  decision  rule  for  a  sensor  while  fixing  the  decision  rules  for  the  remaining  sen¬ 
sors  and  the  fusion  center,  and  updating  the  decision  rule  for  the  fusion  center  while  fixing  the  decision  rules 
for  all  other  sensors.  In  this  sense,  our  approach  is  similar  in  spirit  to  a  suite  of  practical  algorithms  [e.g., 

28]  for  decentralized  detection  under  particular-  assumptions  on  the  joint  distribution  P(X,  Y). 

Using  standard  results  [5],  it  is  straightforward  to  guarantee  convergence  of  such  coordinate-wise  up¬ 
dates  when  the  loss  function  7  is  strictly  convex  and  differentiable  (e.g.,  logistic  loss  (9)  or  exponential 
loss  (10)).  In  contrast,  the  case  of  non-differentiable  7  (e.g.,  hinge  loss  (8))  requires  more  care.  We  have, 
however,  obtained  good  results  in  practice  even  in  the  case  of  hinge  loss. 

Finally,  it  is  interesting  to  note  the  connection  between  the  KQ  algorithm  and  the  naive  approach  con¬ 
sidered  in  Section  2.2.  More  precisely,  suppose  that  we  fix  w  such  that  all  ct,  are  equal  to  one,  and  let  the 
base  kernel  Kz  be  constant  (and  thus  entirely  uninformative).  Under  these  conditions,  the  optimization  of 
G  with  respect  to  Q  reduces  to  exactly  the  naive  approach. 

3.5  Estimation  error  bounds 

This  section  is  devoted  to  analysis  of  the  statistical  properties  of  the  KQ  algorithm.  In  particular-,  our  goal 
is  to  derive  bounds  on  the  performance  of  our  classifier  (Q,  7)  when  applied  to  new  data,  as  opposed  to  the 
i.i.d.  samples  on  which  it  was  trained.  It  is  key  to  distinguish  between  two  forms  of  <^-risk: 

(a)  the  empirical  <j)-risk  E(t>(Y~i{Z))  is  defined  by  an  expectation  over  P(X.  Y)Q(Z\X),  where  P  is  the 
empirical  distribution  given  by  the  i.i.d.  samples  {(27,  yt )  }''=  | . 

(b)  the  true  (j)-risk  E(b(Yy(Z))  is  defined  by  taking  an  expectation  over  the  joint  distribution  P{X.  Y)Q(Z\X). 

In  designing  our  classifier,  we  made  use  of  the  empirical  7-risk  as  a  proxy  for  the  actual  risk.  On  the 
other  hand,  the  appropriate  metric  for  assessing  performance  of  the  designed  classifier  is  the  true  7-risk 
Eo(Yy(Z)).  At  a  high  level,  our  procedure  for  obtaining  performance  bounds  can  be  decomposed  into  the 
following  steps: 

1.  First,  we  relate  the  true  7-risk  E(p(Yy(Z ))  to  the  true  7-risk  E7 (Y f(X)  for  the  functions  /  7  T 
(and  /  €  PFq)  that  are  computed  at  intermediate  stages  of  our  algorithm.  The  latter  quantities  are 
well-studied  objects  in  statistical  learning  theory. 

2.  The  second  step  to  relate  the  empirical  7-risk  E(Y f(X))  to  the  true  7-risk  E(Y f(X)).  In  general, 
the  true  7-risk  for  a  function  /  in  some  class  T  is  bounded  by  the  empirical  7-risk  plus  a  complexity 
term  that  captures  the  “richness”  of  the  function  class  T  [35,  3].  In  particular-,  we  make  use  of  the 
Rademacher  complexity  as  a  measure  of  this  richness. 
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3.  Third,  we  combine  the  first  two  steps  so  as  to  derive  bounds  on  the  true  7-risk  E<p(Yy(Z))  in  terms 
of  the  empirical  oi-risk  of  /  and  the  Rademacher  complexity. 

4.  Finally,  we  derive  bounds  on  the  Rademacher  complexity  in  terms  of  the  number  of  training  samples 
n,  as  well  as  the  number  of  quantization  levels  L  and  M. 

Step  1  :  We  begin  by  isolating  the  class  of  functions  over  which  we  optimize.  Define,  for  a  fixed  Q  £  Q, 
the  function  space  Tq  as 

{/  :  x  {w,$q(x))  =y^jaiyiKQ(x,Xi)  |  s.  t.  \\w\\  <  B},  (25) 

i 

where  B  >  0  is  a  constant.  Note  that  Tq  is  simply  the  class  of  functions  associated  with  the  marginal¬ 
ized  kernel  Kq.  The  function  class  over  which  our  algorithm  performs  the  optimization  is  defined  by  the 
union  T  :=  U q^qTq,  where  Q  is  the  space  of  all  factorized  conditional  distributions  Q(Z\X).  Lastly,  we 
define  the  function  class  To  :=  U qgq0Tq,  corresponding  to  the  union  of  the  function  spaces  defined  by 
marginalized  kernels  with  deterministic  distributions  Q. 

Any  discriminant  function  f  £  T  (or  To),  defined  by  a  vector  a,  induces  an  associated  discriminant 
function  'jf  via  equation  (17).  Relevant  to  the  performance  of  the  classifier  7/  is  the  expected  7-loss 
¥.<p(Yjf(Z)),  whereas  the  algorithm  actually  minimizes  (the  empirical  version  of)  K6(Y  f(X)).  The  rela¬ 
tionship  between  these  two  quantities  is  expressed  in  the  following  proposition. 

Proposition  7. 

(a)  We  have  E(f>(Yyf(Z))  >  E f(Y  f(X)),  with  equality  when  Q(Z\X)  is  deterministic. 

(b)  Moreover,  there  holds 

infE  HYV(Z))  <  infE  f(Yf(X)  (26a) 

/GJ-  J&J-Q 

inf  E^(y7/(Z))  >  m£E4(Yf(X)).  (26b) 

The  same  statement  also  holds  for  empirical  expectations. 

Proof.  Applying  Jensen’s  inequality  to  the  convex  function  7  yields 

E f(YV(Z))  =  ExyE[f(YV(Z))\X,Y}  >  Exy0(E[y7/(Z)|X, Y})  =  Et/>(Yf(X)), 

where  we  have  used  the  conditional  independence  of  Z  and  Y  given  X.  This  establishes  paid  (a),  and 
the  lower  bound  (26b)  follows  directly.  Moreover,  paid  (a)  also  implies  that  infye^r0  ¥.<j>(Y'yf(Z))  = 
inf  E  f(Y  f(X)),  and  the  upper  bound  (26a)  follows  since  To  C  T.  □ 

Step  2:  The  next  step  is  to  relate  the  empirical  <?i-risk  for  /  (i.e.,  E(Y f(X)))  to  the  true  7-risk  (i.e., 
E (Y  f(X))).  Recall  that  the  Rademacher  complexity  of  the  function  class  T  is  defined  [30]  as 

2  n 

Rn(T)  =  E  sup  -  V  C7 if(Xi), 

where  the  Rademacher  variables  ai, ... ,  on  arc  independent  and  uniform  on  {—1,  +1},  and  X\, ... ,  Xn 
are  i.i.d.  samples  selected  according  to  distribution  P.  In  the  case  that  f  is  Lipschitz  with  constant  t,  the 
empirical  and  true  risk  can  be  related  via  the  Rademacher  complexity  as  follows  [20].  With  probability  at 
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least  1  —  5  with  respect  to  training  samples  (X,;,  drawn  according  to  the  empirical  distribution  Pn, 

there  holds 

sup  | E<j>(Yf(X))  -  Ef(Y  f(X))\  <  2 £Rn(P)  +  (27) 

f&T  V  2 n 

Moreover,  the  same  bound  applies  to  Rq. 

Step  3:  Combining  the  bound  (27)  with  Proposition  7  leads  to  the  following  theorem,  which  provides 
generalization  error  bounds  for  the  optimal  </>-risk  of  the  decision  function  learned  by  our  algorithm  in  terms 
of  the  Rademacher  complexities  Rn(fF o)  and  Rn(fF): 

Theorem  8.  Given  n  i.i.d.  labeled  data  points  (a;,;,  y,)”=1,  with  probability  at  least  1  —  25, 

H  l  it  ~  2^«(X)  - 
1=1 

<  inf  mYlfm  < 

J 

inf  -  J2  <KVif(?i))  +  MRn(?o)  +  Z5. 

ftttFo  n  V  2  n 

*=1 

Proof.  Using  bound  (27),  with  probablity  at  least  1  —  5,  for  any  /  G  IF, 

E  f(Yf(X)  -  2^n(X)  - 

i=l 

Combining  with  (26b),  we  have,  with  probability  1  —  5, 


MEf(Yyf(Z))  >  mf  E  cf>(Yf(X)) 

>  inf  -  (Kmfixi))  -  2 lRn(F)  - 

/g^7  n  ^  V  2n 

*=i 

which  proves  the  lower  bound  of  the  theorem  with  probability  at  least  1  —  5.  The  upper  bound  is  similarly 
true  with  probability  at  least  1  —  5.  Hence,  both  arc  true  with  probability  at  least  1  —  25,  by  the  union  bound. 

□ 


Step  4:  So  that  Theorem  8  has  practical  meaning,  we  need  to  derive  upper  bounds  on  the  Rademacher 
complexity  of  the  function  classes  J-  and  T§.  Of  particular  interest  is  the  growth  in  the  complexity  of  T 
and  T{\  with  respect  to  the  number  of  training  samples  n,  as  well  as  the  number  of  discrete  signals  L  and 
M.  The  following  proposition  derives  such  bounds,  exploiting  the  fact  that  the  number  of  0-1  conditional 
probability  distributions  Q(Z\X)  is  a  finite  number,  (LMS). 

Proposition  9. 


Rn(F0)  < 


E  sup  'Y'  KQ(Xi,  Xf)  +  2  (n 

L  QeQott 


1 )  y/n / 2  sup  Kz  ( z ,  z1)  \J 2M S  log  L 


1/2 


(28) 


Proof.  See  Appendix  5. 


□ 
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Although  the  rate  given  in  equation  (28)  is  not  tight  in  terms  of  the  number  of  data  samples  n,  the  bound  is 
nontrivial  and  is  relatively  simple.  (In  particular,  it  depends  directly  on  the  kernel  function  K,  the  number 
of  samples  n,  quantization  levels  L,  number  of  sensors  S,  and  size  of  observation  space  M.) 

We  can  also  provide  a  more  general  and  possibly  tighter  upper  bound  on  the  Rademacher  complexity 
based  on  the  concept  of  entropy  number  [30].  Indeed,  an  important  property  of  the  Rademacher  com¬ 
plexity  is  that  it  can  be  estimated  reliably  from  a  single  sample  (x  i . . . . ,  xn).  Specifically,  if  we  define 
Rn(T)  :=  E[^  supfe:p  Ya= i  aif(xi)\  (where  the  expectation  is  w.r.t.  the  Rademacher  variables  { a, }  only), 
then  it  can  be  shown  using  McDiarmid’s  inequality  that  Rn(T)  is  tightly  concentrated  around  R,,(T)  with 
high  probablity  [4].  Concretely,  for  any  7]  >  0,  there  holds: 

p{\Rn(T)-Rn(T)\  >  r/j  <  2e-'?2n/8.  (29) 

Hence,  the  Rademacher  complexity  is  closely  related  to  its  empirical  version  Rn(T),  which  can  be  related 
to  the  concept  of  entropy  number.  In  general,  define  the  covering  number  N(e,  S ,  p)  for  a  set  S  to  be  the 
minimum  number  of  balls  of  diameter  e  that  completely  cover  S  (according  to  a  metric  p).  The  e-entropy 
number  of  S  is  then  defined  as  log N(e,S,p).  In  our  context,  consider  in  particular  the  L2(Pn)  metric 
defined  on  an  empirical  sample  (aq, . . . ,  xn)  as: 


1  n 


i=l 


||/l  -  f 2  II Z/2(-Pri)  :“ 

Then,  it  is  well  known  [30]  that  for  some  absolute  constant  C,  there  holds: 

rjt)  <  c 


1/2 


log  N(e,T,L2(Pn)) 


de. 


n 


(30) 


The  following  result  relates  the  entropy  number  for  T  to  the  supremum  of  the  entropy  number  taken  over  a 
restricted  function  class  Tq. 


Proposition  10.  The  entropy  number  log  N (e,  T .  TcfC,,  ))  of  T  is  bounded  above  by 

2 Ls  sup  I  Ick 1 1 1  supr  rt  KJz ,  z') 

sup  logN(e/2,RQlL2(Pn))  +  (L  -  l)M51og -  ^ 

QeQ  e 


(31) 


Moreover,  the  same  bound  holds  for  Pq. 

Proof  See  Appendix  5.  □ 

This  proposition  guarantees  that  the  increase  in  the  entropy  number  in  moving  from  some  Tq  to  the 
larger  class  T  is  only  0((L—1)M S  log(Ls /e)).  Consequently,  we  incur  at  most  an  0([MS2(L  —  1)  log  L/n]  2) 
increase  in  the  upper  bound  (30)  for  Rn(T)  (as  well  as  Rn(To)).  Moreover,  the  Rademacher  complexity 
increases  with  the  square  root  of  the  number  L  log  L  of  quantization  levels  L. 
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4  Experimental  Results 


We  evaluated  our  algorithm  using  both  data  from  simulated  sensor  networks  and  real-world  data  sets.  We 
consider  three  types  of  sensor  network  configurations: 

Conditionally  independent  observations:  In  this  example,  the  observations  X1, . . . ,  Xs  arc  indepen¬ 
dent  conditional  on  Y,  as  illustrated  in  Figure  1.  We  consider  networks  with  10  sensors  (5  =  10),  each  of 
which  receive  signals  with  8  levels  (M  =  8).  We  applied  the  algorithm  to  compute  decision  rules  for  L  =  2. 
In  all  cases,  we  generate  n  =  200  training  samples,  and  the  same  number  for  testing.  We  performed  20  trials 
on  each  of  20  randomly  generated  models  P(X,  Y ). 

Chain-structured  dependency:  A  conditional  independence  assumption  for  the  observations,  though 
widely  employed  in  most  work  on  decentralized  detection,  may  be  unrealistic  in  many  settings.  For  instance, 
consider  the  problem  of  detecting  a  random  signal  in  noise  [31],  in  which  Y  =  1  represents  the  hypothesis 
that  a  certain  random  signal  is  present  in  the  environment,  whereas  Y  =  —  1  represents  the  hypothesis  that 
only  i.i.d.  noise  is  present.  Under  these  assumptions  X1, . . . ,  Xs  will  be  conditionally  independent  given 
Y  =  —1,  since  all  sensors  receive  i.i.d.  noise.  However,  conditioned  on  Y  =  +1  (i.e.,  in  the  presence  of 
the  random  signal),  the  observations  at  spatially  adjacent  sensors  will  be  dependent,  with  the  dependence 
decaying  with  distance. 

In  a  1-D  setting,  these  conditions  can  be  modeled  with  a  chain-structured  dependency,  and  the  use  of  a 
count  kernel  to  account  for  the  interaction  among  sensors.  More  precisely,  we  consider  a  set-up  in  which 
five  sensors  arc  located  in  a  line  such  that  only  adjacent  sensors  interact  with  each  other.  More  specifically, 
the  sensors  Xt-\  and  Xt+\  arc  independent  given  Xt  and  Y,  as  illustrated  in  Figure  2.  We  implemented 
the  kernel-based  quantization  algorithm  using  either  first-  or  second-order  count  kernels,  and  the  hinge  loss 
function  (8),  as  in  the  SVM  algorithm.  The  second-order  kernel  is  specified  in  equation  (19)  but  with  the 
sum  taken  over  only  t,  r  such  that  \t  —  r\  =  1. 


(a)  (b) 


Figure  2.  Examples  of  graphical  models  P(X,Y)  of  our  simulated  sensor  networks,  (a)  Chain-structured 
dependency,  (b)  Fully  connected  (not  all  connections  shown). 

Spatially-dependent  sensors:  As  a  third  example,  we  consider  a  2-D  layout  in  which,  conditional  on 
the  random  target  being  present  (Y  =  +1),  all  sensors  interact  but  with  the  strength  of  interaction  decaying 
with  distance.  Thus  P(X\Y  =  1)  is  of  the  form: 

P{X\Y  =  1)  oc  exp  {  ^  ht-MX*)  +  Y,  dtr-'MX*) I„(2T)}. 

t  ty^r;uv 
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Here  the  parameter  h  represents  observations  at  individual  sensors,  whereas  6  controls  the  dependence 
among  sensors.  The  distribution  P(X\Y  =  —1)  can  be  modeled  in  the  same  way  with  observations  h' , 
and  setting  O'  =  0  so  that  the  sensors  arc  conditionally  independent.  In  simulations,  we  generate  8tr-uv  ~ 
N(l/dtr,  0.1),  where  dt.r  is  the  distance  between  sensor  t  and  r,  and  the  observations  h  and  h!  are  randomly 
chosen  in  [0,  l]  s.  We  consider  a  sensor  network  with  9  nodes  (i.e.,  S  =  9),  arrayed  in  the  3  x  3  lattice 
illustrated  in  Figure  2(b).  Since  computation  of  this  density  is  intractable  for  moderate-sized  networks,  we 
generated  an  empirical  data  set  (xy,  yt )  by  Gibbs  sampling. 


Naive  Bayes  sensor  network 


Chain-structured  sensor  network 


(c) 


Figure  3.  Scatter  plots  of  the  test  error  of  the  LR 
(b)  Chain  model  with  first-order  kernel,  (c),  (d) 
model. 


Chain-structured  sensor  network 


(b) 


Fully  connected  sensor  network 


KQ  methods,  (a)  Conditionally  independent  network, 
model  with  second-order  kernel,  (d)  Fully  connected 


We  compare  the  results  of  our  algorithm  to  an  alternative  decentralized  classifier  based  on  performing 
a  likelihood-ratio  (LR)  test  at  each  sensor.  Specifically,  for  each  sensor  t,  the  estimates 
for  u  =  1,  ■■■ ,  M  of  the  likelihood  ratio  are  sorted  and  grouped  evenly  into  L  bins.  Given  the  quantized 
input  signal  and  label  Y,  we  then  construct  a  naive  Bayes  classifier  at  the  fusion  center.  This  choice  of 
decision  rule  provides  a  reasonable  comparison,  since  thresholded  likelihood  ratio  tests  are  optimal  in  many 
cases  [28]. 

The  KQ  algorithm  generally  yields  more  accurate  classification  performance  than  the  likelihood-ratio 
based  algorithm  (LR).  Figure  3  provides  scatter  plots  of  the  test  error  of  the  KQ  versus  LQ  methods  for  four 
different  set-ups,  using  L  =  2  levels  of  quantization.  Panel  (a)  shows  the  naive  Bayes  setting  and  the  KQ 
method  using  the  first-order  count  kernel.  Note  that  the  KQ  test  error  is  below  the  LR  test  error  on  the  large 
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majority  of  examples.  Panels  (b)  and  (c)  show  the  case  of  chain-structured  dependency,  as  illustrated  in 
Figure  2(a),  using  a  first-  and  second-order  count  kernel  respectively.  Again,  the  performance  of  KQ  in  both 
cases  is  superior  to  that  of  LR  in  most  cases.  Finally,  panel  (d)  shows  the  fully-connected  case  of  Figure  2(b) 
with  a  first-order  kernel.  The  performance  of  KQ  is  somewhat  better  than  LR,  although  by  a  lesser  amount 
than  the  other  cases. 

UCI  repository  data  sets: 

We  also  applied  our  algorithm  to  several  data  sets  from  the  machine  learning  data  repository  at  the 
University  of  California  Irvine  [6] .  In  contrast  to  the  sensor  network  detection  problem,  in  which  communi¬ 
cation  constraints  must  be  respected,  the  problem  here  can  be  viewed  as  that  of  finding  a  good  quantization 
scheme  that  retains  information  about  the  class  label.  Thus,  the  problem  is  similar  in  spirit  to  work  on  dis¬ 
cretization  schemes  for  classification  [10].  The  difference  is  that  we  assume  that  the  data  have  already  been 
crudely  quantized  (we  use  m  =  8  levels  in  our  experiments),  and  that  we  retain  no  topological  informa¬ 
tion  concerning  the  relative  magnitudes  of  these  values  that  could  be  used  to  drive  classical  discretization 
algorithms.  Overall,  the  problem  can  be  viewed  as  hierarchical  decision-making,  in  which  a  second-level 
classification  decision  follows  a  first-level  set  of  decisions  concerning  the  features. 


Data 

L  =  2 

4 

6 

NB 

CK 

Pima 

0.212 

0.217 

0.212 

0.223 

0.212 

Iono 

0.091 

0.034 

0.079 

0.056 

0.125 

Bupa 

0.368 

0.322 

0.345 

0.322 

0.345 

Ecoli 

0.082 

0.176 

0.176 

0.235 

0.188 

Yeast 

0.312 

0.312 

0.312 

0.303 

0.317 

Wdbc 

0.083 

0.097 

0.111 

0.083 

0.083 

Table  1:  Experimental  results  for  the  UCI  data  sets. 

We  used  75%  of  the  data  set  for  training  and  the  remainder  for  testing.  The  results  for  our  algorithm  with 
L  =  2, 4,  and  6  quantization  levels  are  shown  in  Table  1.  Note  that  in  several  cases  the  quantized  algorithm 
actually  outperforms  a  naive  Bayes  algorithm  (NB)  with  access  to  the  real-valued  features.  This  result  may 
be  due  in  paid  to  the  fact  that  our  quantizer  is  based  on  a  discriminative  classifier,  but  it  is  worth  noting 
that  similar  improvements  over  naive  Bayes  have  been  reported  in  earlier  empirical  work  using  classical 
discretization  algorithms  [10]. 


5  Conclusions 

We  have  presented  a  new  approach  to  the  problem  of  decentralized  decision-making  under  constraints  on 
the  number  of  bits  that  can  be  transmitted  by  each  of  a  distributed  set  of  sensors.  In  contrast  to  most 
previous  work  in  an  extensive  line  of  research  on  this  problem,  we  assume  that  the  joint  distribution  of 
sensor  observations  is  unknown,  and  that  a  set  of  data  samples  is  available.  We  have  proposed  a  novel 
algorithm  based  on  kernel  methods,  and  shown  that  it  is  quite  effective  on  both  simulated  and  real-world 
data  sets. 

This  line  of  work  described  here  can  be  extended  in  a  number  of  directions.  First,  although  we  have 
focused  on  discrete  observations  X,  it  is  natural  to  consider  continuous  signal  observations.  Doing  so  would 
require  considering  parameterized  distributions  Q(Z\X).  Second,  our  kernel  design  so  far  makes  use  of 
only  rudimentary  information  from  the  sensor  observation  model,  and  could  be  improved  by  exploiting  such 
knowledge  more  thoroughly.  Third,  we  have  considered  only  the  so-called  parallel  configuration  of  the 
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sensors,  which  amounts  to  the  conditional  independence  of  Q(Z\X).  One  direction  to  explore  is  the  use 
of  kernel-based  methods  for  richer  configurations,  such  as  tree-structured  and  tandem  configurations  [28]. 
Finally,  the  work  described  here  falls  within  the  area  of  fixed  sample  size  detectors.  An  alternative  type  of 
decentralized  detection  procedure  is  a  sequential  detector,  in  which  there  is  usually  a  large  (possibly  infinite) 
number  of  observations  that  can  be  taken  in  sequence  (e.g.  [32]).  It  is  also  interesting  to  consider  extensions 
our  method  to  this  sequential  setting. 
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Proof  of  Lemma  1:  (a)  Since  x\ , . . . ,  xn  arc  independent  realizations  of  the  random  vector  X,  the  quantities 
Q(z\x\ ), . . . ,  Q(z\xn)  arc  independent  realizations  of  the  random  variable  Q{z\X).  (This  statement  holds 
for  each  fixed  z  €  Zs .)  By  the  strong  law  of  large  numbers,  there  holds 


1  n 

-  ^  Q(z\xi )  KQ(z\xi)  =  P(z ) 

71  < 


n  . 
*=i 

1  sr^n 


as  n  — >  +00.  Similarly,  we  have  ^  Q(z\xi)KUi  =  1)  — 4  ~MjQ(z\X)1(Y  =  1).  Therefore,  as  n  — >  oo, 


k(z) 


i.s.  EQ(z\X)I{Y  =  1) 
P(z) 


^  Q{z\X  =  x)P{X  =  x,Y  =  1)  _  _  1 ,  , 

P(z)  1  ' 


where  we  have  exploited  the  fact  that  Z  is  independent  of  Y  given  X. 
(b)  For  each  z  €  Zs ,  we  have 


sign 


(Ya= i  Q{z\xi)l{yi  =  1)  Ya= i  Q{z\xi)I{y.i  =  -1) 


V  El U  Q(z \xi) 

E”=i  Q(z\xj)yi 

T!LiQ(z\xi) 


E”=i  Q(z\xi) 


=  sign 


—  r)emp(z). 

Thus,  part  (a)  implies  7 emp(z)  7 opt{z )  for  each  z.  Similarly,  Remp  ~ >  Ropt- 


Proof  of  Proposition  5  Here  we  complete  the  proof  of  Proposition  5.  It  remains  to  show  that  the  optimum 
w(Q)  of  the  primal  problem  is  related  to  the  optimal  a  of  the  dual  problem  via  w(Q)  =  E?:=i 
Indeed,  since  G(w)  is  a  convex  function  with  respect  to  w,  w(Q)  is  an  optimum  solution  for  min,,,  G(vr.  Q) 
if  and  only  if  0  G  dwG(w(Q)).  By  definition  of  the  conjugate  dual,  this  condition  is  equivalent  to  w(Q)  G 
dG*{  0). 

Recall  that  G*  is  an  inf-convolution  of  n  functions  gf, . . . ,  <7*  and  H*.  Let  a  :=  (aT, . . . ,  afi)  be  an 
optimum  solution  to  the  dual  problem,  and  u  :=  [u  ], . . . ,  un )  be  the  corresponding  value  in  which  the 
infimum  operation  in  the  definition  of  G*  is  attained.  Applying  the  subdifferential  operation  rule  on  a  inf- 
convolution  function  (Cor.  4.5.5,  [14]): 
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dG*( o)  =  dg\{^)  n...na«n an*(-  y $)■ 

But  S2*(n)  =  ^ 1 1 y 1 1 2 ,  and  so  <9D*(—  Y^=i  &%)  reduces  to  a  singleton  —  ]U"=1  =  Y^i=\  otiVi^Q^i)-  This 

implies  that  w(Q )  =  Yl?=  t  ^iVi^Q(xi)  is  the  optimum  solution  to  the  primal  problem. 

To  conclude,  it  will  be  useful  for  the  proof  of  Lemma  6  to  calculate  dg*  (ui),  and  derive  several  additional 
properties  relating  w(Q)  and  a.  The  expression  for  g*  in  equation  (23)  shows  that  it  is  the  image  of  the 
function  under  the  linear  mapping  a*  i->  jai(yi&q(xi).  Consequently,  by  Theorem  4.5.1  of  Urruty 
and  Lemarechal  [14]),  we  have  d g*{ui)  =  {w  :  ( w ,  yi^q(xi)}  G  AS*)},  which  implies  that  bi  := 

{ w(Q ),  Ui&Q^Xi))  G  dc/)*(—\ oti)  for  each  i  =  1, . . . ,  n.  By  convex  duality,  this  also  implies  that  —A S;  G 
dcj)(bi)  for  i  =  1, . . . ,  n. 


Proof  of  Lemma  6:  We  shall  show  that  the  subdifferential  dqtrzt\xt\G  can  be  computed  directly  in  terms  of 
the  optimal  solution  a  of  the  dual  optimization  problem  (21)  and  the  kernel  function  Kz.  Our  approach  is 
to  first  derive  a  formula  for  dqiz\^G,  and  then  to  compute  dqt^zt^t)G  by  applying  the  chain  rule. 

Define  bi  :=  ( w(Q ),  ygbqixi)).  Using  Theorem  23.8  of  Rockafellar  [24],  the  subdifferential  Oq(z^G 
evaluated  at  ( w(Q );  Q)  can  be  expressed  as 

n  n 

9q(z\x)G  =  ydq^gi  =  yd<t){bi)yi(w,&{z))l[xi  =  x\. 

i— 1  i=  1 

Earlier  we  proved  that  —  A  a,  G  0(p(bt)  for  each  i  =  1, . . .  ,n,  where  a  is  the  optimal  solution  of  (21). 
Therefore,  dq^2\x)G  evaluated  at  (w(Q):  Q)  contains  the  following  element: 

n 

Y  -A aiyi(w(Q),  &{z))l[xi  =  x\ 
i= 1 

n  n 

=  y  -A am(y  ajyj$Q{xj),  <$>'(z))l[xi  =  x) 
i=  1  j= 1 

=  y  -A aiajyiyjl[xi  =  x\  y  K(z,  z)Q(z\xj). 
i,j  * 

For  each  t  =  1, . . . ,  S,  Oqt^t^G  is  related  to  dq/z\^G  by  the  chain  rule.  Note  that  Q(z\x)  =  ah  Qt(zt\xt). 

^  v  ^Qt(zt\xt)Q{z \^)^Q{z\x)^ 
z,x 

=  E  mmklW  = 

z,x  ^  '  '  ' 


which  contains  the  following  element  as  one  of  its  subgradients: 

y  =  x^Ilz*  =  zt]ly  -A aiOtjyiyjllxi  =  x]  y  Kz{z' ,  z)Q{z'\xj) 

Z, X  ^  '  i,j  z 

=  y  -\aiajyiyjl\x\  =  z*]I [zf  =  zf\  Q{z'\xj)Kz{z' ,  z) 

...  G  (Z  •t'j) 
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This  completes  the  proof  of  the  lemma. 


Proof  of  Proposition  9:  By  definition  of  Rademacher  complexity  [30],  we  have 


Rn(Fo)  =  E  SUp  -  Y/  &if(Xi 


=  E  sup  $Q(Xj)) 

lk||<S;QeQo  n  i=1 

213  71 

=  — E  sup  ||  VVi$Qpfi)||. 
n  Q&Qo 


Applying  the  Cauchy-Schwarz  inequality  yields 
2  B 


Rni^o)  <  - 


n 

2  B 


M 


E  sup  |  ST  at  <t>Q(Xl 
QeQo 


1/2 


=  —  E  sup  y'KQ(Xi,Xi)  +  2E  sup  Y]  <Ji<jjKQ(Xi,  Xj 

n  '  QeSo  orOn._. 


QgQo 


l<i<j<n 


It  remains  to  upper  bound  the  second  term  inside  the  square  root  in  the  RHS.  The  trick  is  to  partition  the 
n(n  —  l)/2  pairs  of  (i,j)  into  n  —  1  subsets  each  of  which  has  n/2  pairs  of  different  i  and  j  (assuming  n  is 
even  for  simplicity).  The  existence  of  such  a  partition  can  be  shown  by  induction  on  n.  Now,  for  each  i  = 
1, . . . ,  n— 1,  denote  the  subset  indexed  by  /  by  n/2  pairs  (nl (j),  7r'(j))"^ ,  where  all  {^(1), . . . ,  7Tj(n/2)}n 
{vr-(l), . . . ,  7r-(n/2)}  =  0.  Therefore, 

n— 1  n/2 

E  sup  /  '  (Ti(j j Kq^Xi,  Xj')  —  E  sup  EE"'  i  0‘)  aK  0) 71  <3  ( Yr,  0)  ’  Yr'  0) ) 


QGQo 


QGQo  i=1  ;  | 
n—l  n/2 

—  SUP  0)  °~7rf  Q)  KQ  (  Yr,;  (?) >  Yr.'  (.7) )  • 

i=l  QeSo  j=l 


Our  final  step  is  to  bound  the  terms  inside  the  summation  over  i  by  invoking  M assart’s  lemma  [22]  for 
bounding  Rademacher  averages  over  a  finite  set  A  C  i(l: 


E  sup  (Tidi  <  max  ||a||2\/2  log  |A|. 
a€A  i=i 


(32) 


Now,  for  each  i  and  a  realization  of  X\, . . . ,  Xn,  treat  an.^an'^  for  j  =  1 . . . . .  n/2  as  n/2  Rademacher 
valuables,  and  the  n/2  dimensional  vector  {Kq{X^.^,  2f7r/(J-)))J^  takes  on  only  LAIS  possible  values 
(since  there  arc  LAIS  possible  choices  for  Q  £  Qq).  Then  we  have,  for  each  z  =  1, . . . ,  n  —  1: 

n/2 

E  sup  E"'  iU)<T*'iti)KQ(X*iU)’X<U)')  -  sup  Kz(z,  z') \J 2  log(LMS), 

QeQo  j—i  z,z' 
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from  which  the  lemma  follows. 


Proof  of  Proposition  10:  We  treat  each  Q(Z\X)  €  Q  as  a  function  over  all  possible  values  (z,  x).  Recall 
that  X  is  an  5-dimensional  vector  X  =  (X1, . . . ,  Xs).  For  each  fixed  realization  x 4  of  Xt,  for  t  =  1, . . . ,  5, 
the  set  of  all  discrete  conditional  probability  distributions  Q(Zt\xt)  is  a  (L  —  1)  simplex  A/,.  Since  each 
Xt  takes  on  M  possible  values,  and  X  has  5  dimensions,  we  have: 

N(e,  Q,  Loo)  <  N(e,  AL,  loo)MS  <  (l/e)^MS. 

Recall  that  each  /  gF  can  be  written  as: 

n 

f(x)  =  ^2  a.i  ^  Q(z\x)Q(zi\xi)Kz(z,  Zi).  (33) 

i=  1  z,zi 

We  now  define  eo  :=  e  [2LS  sup  ||a||i  sup2  z,  Kz(z,z')]~1.  Given  each  fixed  conditional  distribution  Q  in 
the  eo-covering  G(e o,  Q,  Loo)  for  Q,  we  can  construct  an  e/2-covering  in  L2(Pn)  for  Tq.  It  is  straightfor¬ 
ward  to  verify  that  the  union  of  all  coverings  for  Tq  indexed  by  Q  G  G(eo,  Q,  L~c)  forms  an  e-covering 
for  T.  Indeed,  given  any  function  /  G  T  that  is  expressed  in  the  form  (33)  with  a  corresponding  Q  €  Q, 
there  exists  some  Q*  €  G(e o,  Q,  L^)  such  that  ||Q  —  Q*||oo  <  eo.  Let  f\  be  a  function  in  Tq*  using  the 
same  coefficients  a  as  those  of  /.  Given  Q*  there  exists  some  f-j  G  Tq*  such  that  ||/i  —  f 2 \ \  l2 i pn  j  <  e/2. 
Applying  the  triangle  inequality  yields 

11/  -  h\\L2(Pn)  <  11/  -  /l||i2(Pn)  +  ll/l  _  h\\L2(P„) 

A  ||/  —  /i||  00  +  e/2 

<  Ls  sup  |  |ck|  |i  sup  Kz(z,  z')  || Q  -  Qloo  +  e/2, 

z,z’ 

which  is  bounded  above  by  e.  In  summary,  we  have  constructed  an  e-covering  in  L2(Pn )  for  T  whose 
number  of  coverings  is  no  more  than  N(e 0,  Q,  L^)  supq  iV (e/2,  Tq,  L2(Pn ))■  This  implies  that 


\og  N(e,T,L2{Pn))  <  log<^  N(e0,  Q,  L^)  sup  iV(e/2,  TQ,L2{Pn)) 

Q 

rnTq  ..  ..  .  (L—1)MS 

2L°  sup  ||o!||i  sup2  z,  Kz[z,  z  )  \ 


<  log 


=  sup  log  AT  (e/2,  Tq,  L2{Pn))  +  (L  -  1)  MS  log 
Q&Q. 


sup  N(e/2,  Tq,  L2{Pn)) 

Q 

2 Ls  sup  ||a|  |i  sup2  z,  Kz{z,  z') 


which  completes  the  proof. 
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